DOCUMENT RESUME 

ED 378 243 TM 022 628 



AUTHOR 
TITLE 

INSTITUTION 

REPORT NO 
PUB DATE 
NOTE 

PUB TYPE 



Walberg, Herbert J«; And Others 

Assessment Reform: Challenges and Opportunities* 

Fastback 377, 

Phi Delta Kappa Educational Foundation, Bloomington, 
Ind, 

ISBN-0-87367-377-8 

94 

43p, 

Reports - Evaluative/Feasibility (142) 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



MF01/PC02 Plus Postage* 

^Academic Achievement; ^^Accountabi 1 i ty ; '^Cognitive 

Psychology; Cost Effectiveness; '^Educational 

Assessment ; Educational Change; Educational 

Innovation; Educational Pol icy; Educational 

Technology; Elementary Secondary Education; 

Intel 1 igence Tests ; School Restructuring; Standards ; 

Test Construction; '^Test Use 

High Stakes Tests; '^Reform Efforts 



ABSTRACT 

This fastback reference analyzes contrast ing opinions 
about educational assessment and testing in the light of available 
evidence. The reform of student assessment is an essential component 
of the revital ization of American schools. Accountability issues 
relate to the proliferation of testing and the increasing use of 
high~stakes tests for policy decisions, A new focus on cognitive 
psychology has stimulated innovations in assessment practices. While 
cognitivists may attempt to go beyond behaviorally developed tests, 
they have yet to produce convincing and practical methods that can be 
easily used in classrooms. Technological developments are making 
tests easier to develop, administer, and score, but critical economic 
and technological barriers must be overcome before technology 
fulfills its promise in assessment. As the adequacy of current 
assessments is considered, three areas of debate arise: purposes of 
assessment, standards of technical quality, and cost. These 
considerations are equally important in the development of 
alternative assessments. Alternative assessments promise a great deal 
yet require sober evaluation. One figure illustrates a developed test 
i tem. (Contains 25 references . ) (SLD) 
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Traditional vs. Alternative Assessments 



l^cw forms i)f asscssiiiciU result IYdiu cducalii)n reforms, develop- 
menls in psychologN . and advances in testing technology. Much i>f 
the debate abi>ut assessment reform has relied on opinion, not fact. 
This fasibaek analyzes contrasting opinions in the light of available 
evidence. 

In this tastback the term "traditional tests" refers to standardized, 
norm -re fere need, nuiltiple-choice achie\ement tests administered us- 
ing a paper-pencil format vmder standardized conditions. These tests 
are used to measure individual student performance so that students' 
sct>res can be compared. Durmg the past decade such traditional tests 
ha\e Kvn challenged. 

The term "alternative assessments" means all assessments i)ther than 
traditional tests. Alternative assessments include essa\s, portfolios. 
intervievKs. simulations, projects, and performances. Many alterna- 
tive assessments, such as assigned vvritten compositions and pi>rifo- 
bos of artwork, have been used tor decades: thus some of them are 
neither new nor uiuested. 

The term "authentic assessments" refers to assessments. especialU 
performance assessments, that purportedly measure valuable, real- 
Nvorld. complex tasks. Authentic assessments otk*n are contraste^i with 
traditional tests and promoted as being a significant improvement over 
the limitations of traditional tests. In this fastback. the term "authen- 
tic assessments" is rarely used. The term "authentic" is a rhetorical 




device that suggot.s that iradUional asscssinonls arc mauthentic or do 
not measure impi>nant kiH»\s Icd^o o? skills I ho t Imw ot the auihciuic 
assesMiient adxiKatcs havo \oi to Iv |>it)\oil 



Influence of Education Reform 



During the pasl decade. American public schix>Is have been in ihe 
thriKs of reform. Apparent piH>r student performance on basic skills 
and knowledge tests, low levels of achievement for U.S. students com- 
pared to their international counterparts, and low rates of adult liter- 
acy have caused educators, as well as the general public, to call for 
reforms. The reform of student assessment is an essential component 
i>f the revitali/ation o\ American schools. 

Accountability 

The public \alues assessment data as ii means to evaluate students. 
sch(H>l systems, retomi etioils. and the standing of U.S. students com- 
pared to students in other nations. Siiue accouiitabiliU places iesi>on 
Nihility for the success of the stuilents imi then teachers, ii has Ivcomc 
a centra! feature of education ivh>rm. vSiMne leloimeis iK^lievc that 
t!ie education system will improve o\\\\ it leaeheis are hehl account 
able for their students* test tvi lormaiice. because iissessment data are 
the l>est evidence that schools are retoittm^! A<lei)uate levels ol 
achievemeiu should be defined in terms of iiiitiitnal stinuhinls. as well 
as coniparative standards of progiess aiiUMig students, Thus some 
reformers argue that the quality <»! educalion will Iv liiipioveil onlv 
by establishing high standards ot achievement and lit>lding teuehets 
responsible for ensuring thai then stmlenis meet those staiula* ! 




Proliferation of Testing 

Elcnicman and secondary students take an estimated 127 million 
separate standardized tests each year as a result of district and state 
mandates (National Commission on Testing and Public Polic\ IWO). 
During i986-l987. approximately 105 million standardized tests were 
administered to 39.8 million public schixil students. Of these, more 
than 55 million were tests of achievement, competency, and basic 
skills adntinistcred to students in compensatory and special educa 
lion programs. Some two million tests were used to screen prekm 
dergarten and kindergarten studcms, and 41 million tests >^ere 
administered in regular classnxims in grades 1 to 12. Ihe Cicneral 
Education Development testing program, the National Assessment of 
Educational Progress (N AEP). and admission requirements lor a va 
riety of colleges and secondary schtx)ls accounted for an addnit>nal 
six million to seven million tests (Ncill and Medina 1989). The Na- 
tional Commission on Testing and Public Policy ( 1990) reported that 
test revenues doubled between 1960 and 1966, and increased live- 
fold between 1967 and 1980. The revenues increased frvim approxi- 
mately $40 million in 1^*^)0 to $100 million in 1989. 



HiRh-Stakes TtstinR 

Whenever impi>rlant consequences are attached to test results, it 
IS considered high slakes testing. The Scholastic Aptitude Test (SAT) 
and the American College lesting program (ACT) have always been 
high- stakes tests for college-bound students, because receiving a pixir 
sci>re may result m the test taker being denied admission to the col- 
lege of choice Sclux>l systems may suffer enrollment drops because 
of the im|X>rtaiice given to test scores by sonte community ntentbers. 
I'ven the real estate tnarket may be affected by the newspaper reports 
of liKal tost scores and the ranking oi' >chix)ls and districts according 
to their test scores. 




Media reporting of test scores has raised the stakes for schools and 
students. Teachers feel pressured to improve test scores and to cover 
tested material. Some districts use assessment scores to determine mer- 
it pay and dismissal decisions. Increasing the stakes of tests for 
teachers and administrators can exacerbate problems of overzealous 
test preparation and teaching to the test, Darling-Hammond (1991) 
listed the negative consequences of using test scores to make deci- 
sions about rewards or sanctions for schm>ls and teachers including. 

. . . designating largo numbers of low -scoring students for place- 
ment in special education so that their scores won't "count" in schm>l 
reptms. retaining students in grade so that their relative standing w ill 
lix)k better on grade-equivalent scores, excluding low-scoring students 
from admission to "open enrollment" schiH>ls, and enci>uraging low- 
scoring students to drop out. (p. 223) 

In many states test scores have risen in the first few years follow- 
ing the introduction of a high-stakes testing program. Whether these 
increased scores reflect real improvement in student achievement or 
only gains specific to a particular test remains to be determined. Some 
studies show that dropout rates increase in schwis with competency 
tests as a graduation re tirement and test-based retention policies 
(Madaus 1991). Students usually are motivated to do well on tests 
if they see a relationship between their performance on these tests 
atid their grades or college and job prospects. 

Htnphasis on tests can pnKlucc desirable effects on curriculum, 
teachmg. atul learnnig. High-stakes tests may serve to Ukhis instruc- 
tion and highlight students' and teachers' goals. Some researchers as- 
sert that ii Ivller match Ivlween what is limi'hl and what is tested may 
roMlijIi/e an obsolete cutncnhtm 




Influence of Psychology 



C^ogniiive psychology challenges common views of learning, teach- 
ing, and assessnienl. The ^hifl from behaviorism lo cognitive psy- 
chology in the lale 1950s initialed a new focus on how individuals 
learn, think, and acquire and apply knowledge. This new ftKUs 
sliiiiulated innovations in assessment practices. 

Cognitive psychologists see learners as actively constructing knowl- 
edge structures that learners modify as their level of expertise rises. 
Behaviorists emphasize that higher-order understandings are the result 
of mastering discrete skills and prerequisite learnings. Thus be- 
haviorists see teachers as knowledge transmitters who directly influ- 
ence student learning, while cognitivists believe that teachers indirectly 
enhance student thought by asking questions, providing examples, 
giving instructions, and creating learning environments. 

Behaviorists believe that complex prtKesses, such as reading cc^m- 
prehension, can be broken down into a series of discrete skills. For 
behaviorists, tests are constructed by specifying behavioral outcomes 
that must be mastered for each instructional goal. In contrast, cog- 
nitivists believe that assessments should measure a wide range of 
lho''ghl» including knowledge, metacognilive prcK'esses, learning er- 
ror., and affective thought prtKCsses. Cognitivists measure knowl- 
edge by assessing the relationships among t*acls» principles, 
priKvdures, and beliefs. I hey measure metacognilive skills that an 
iiuliN idual uses lo appraise his or her own thinking. Including the abil- 



ity to plan, activate, monitor, and evaluate actions. Affective thoughts 
are measured through coping and self-regulatory skills. 

Merlin Witlrock (1991) expresses the concerns of cognitive psy- 
chologists: 

Many slaiuiardi/od iniclligcnco tests, achievement tests, and ability 
tcMs . . were not designed to measure diagnosiically useful cogni- 
tive and alleclive ihouglit processes. . , . |T|hese tests do not meas- 
nie snidem tonccpiions. learning strategies, or nielacogniiion or 
.ilkvtive thought processes relevant to instruction, (p. 3) 

Despite these claims from cognitive psychologists, many educa- 
tors a!ul other psychologists adhere to behaviorism or adopt an eclectic 
approach. Hehaviorisni is evident in mastery learning, computer- 
assisted instruction, and criterion-referenced testing. Cognitivists may 
strive to go l)eyond beliaviorally developed tests, but they have yet 
to priKlucc convincing, practical methods that can be used easily in 
classrooms. 



Measurement of School Achievement 

Four broad understandings have emerged from cognitive psychol- 
ogy: I ) the description of subject matter in terms of declarative, pro- 
cedural, and prior knowledge; 2) the characterization of increases in 
knowledge along a continuum from novice to expert performance; 
h the cataloguing of learning errors specific to subject areas; and 
4) the identification of metacognitive processes und learning strategies 
that individuals use to manage their own learning. 

Declarative. Procedural and Prior Knowled^^e. Students organize 
knowledge into schemas tha* are unique to the subject matter. Declara- 
tive knowledge is a network of facts and ideas. A student\ ability 
to retrieve information efficiently is directly related to the organiza- 
tion of his or her declarative knowledge. According to psychologists, 
achievement testing in a subject area should include both estimates 
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of the amount of declarative knowledge a student possesses and how 
that knowledge is organized. 

Traditional tests can provide a partial measurement of declarative 
knowledge- Items that require the student to recognize the correct 
answer can be used to assess the student's command of facts, princi- 
ples, and vtxrabulary. Such tests may be less able to measure the way 
the student organizes information. Alternative item types, such as word 
assiKialions or semantic maps, are more suited to measuring the or- 
ganization of knowledge. 

PriKcdural knowledge is knowledge of the priKesses and routines 
used in thinking. As knowledge becomes proceduralized, it becomes 
automatic and requires little attention by learners. The more quickly 
a student completes a task, the more priKeduralized the knowledge 
and skills have become. There are no practical methods for teachers 
to lest priKcdural knowledge, except ihrough experienced observation, 
Prior knowledge refers to the kniAvlcOge and skills that a student 
brings to the instructional setting. A student^ idiosyncratic knowl- 
edge structures include not only the knowledge and skills they have 
acquired, but also their preconceptions, misconceptions, and beliefs. 
Information about a student^ prior knowledge is useful when plan- 
ning instruction. 

The diagnosis of preconceptions, misconceptions, and beliefs can 
be accomplished through the use of constructed-response, alternative 
test items, as well as with multiple-choice items. Although facts and 
skills frequently are assessed, assessment of preconceptions, miscon- 
ceptions, and beliefs is rarely used. 

Novice and Expert Performance, Experts possess more complex 
knowledge structures than novices and efficiently organize their 
knowledge, They pay little attention to the surface characteristics of 
problems and carefully monitor their own problem solving, ExjKrls 
generate rich problem representations as a guide for selecting solu- 
tions. Assessing expertise is easiest in subjects such as mathematics, 
in which the content is explicit and problem solving is well understixxi. 



Several techniques used to document novice and expert differences 
include transcripts of students' solutions to problems, semantic or con- 
ceptual maps, and word associations. Semantic maps show relation- 
ships among the words and concepts that students use. Word 
asstKiations involve generating word responses to a stimulus word. 
These assessment mcthtxls arc less suitable for classroom use than 
for research because they require training the test administrators, ad- 
ministering individual assessments, transcribing transcripts, and 
detailed analyzing and scoring of responses. 

Studies of expert performance can identify milestones that students 
need to master enroute to expert performance. These milestones can 
serve as a blueprint for test specifications. Assessments of students 
subject matter expertise should consider: 1) the level of detail used 
to represent a problem, 2) the characteristics of the problem, 3) the 
conceptual skills and principles used, 4) the degree of organization 
and flexibility in reasoning, and 5) the selection and execution of so- 
lution strategies. 

Leantin^i Errors. Individuals make a variety of errors when solving 
problems in specific subjects. Some psychologists believe that errors 
are rule-governed. Rule-governed errors are exemplified by the sys- 
tematic mistakes of elementary-age students when applying subtrac- 
tion algorithms or doing place- value arithmetic. Other types of learning 
errors include naive theories and misconceptions. Naive theories are 
common prescienlific beliefs that individuals hold about nutural 
phenomenr. For example, in astronomy, some students believe the 
sun rotates around the earth. As individuals mature and increase their 
knowledge of these phenomena, they shift toward more scientific 
conceptions. 

Researchers stress error identification because it can be helpful in 
diagnosing learning difficuhies and in developing remediation. Learn- 
ing errors are more easily identified in mathematics or the sciences. 
In less-defined subject areas, such as the arts, compiling an invento- 
ry of learning errors is difficult. 
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Learning errors cannot be diagnosed using traditional tests. 
Researchers have developed alternative nicthixls, including individual 
clinical interviews, semantic maps, and verbal transcripts of expla- 
nations. These individually administered assessments require large 
expenditures o\' time and money; thus there has been some interest 
in developing group measures ol' learning errors. 

Mctac(>}^nitivc Proiysscs and lAwning Strategies. Metacognitive 
processes involve the self-management of thinking. These prwesses 
include planning, activating, monitoring, and evaluating one s actions. 
Metacognitive skills can be s|x?ciric to a subject area or they can be 
g'Micral. Knowing tliat a particular strategy will enhance performance 
and knowing how and under what conditions to apply the strategy 
are metacognitive skills. Many reading programs, for example, now 
arc designed to teach metacognitive skills. 

Weinstcin and Meyer (19*^1) identified several types of learning 
strategies, such as rehearsal, elaboration, and organization. Rehear- 
sal requires the simple repetition of items in order to secure ihcm 
in memory. Hlaboration involves the addition of symbolic content, 
such as mental imagery, to increase the meaningful ncss of the infor- 
mation to be learned. Elab^)ration facilitates the integration of knowl- 
edge by increasing the relationships among information m a student s 
knowledge structure. Organization transforms information into a for- 
mat that is easier to understand. The construction of a timeline is an 
example of organization. 

Comprehension monitoring is another metacognitive skill, which 
involves establishing learning goals, assessing their accomplishment, 
and modifying ineffective strategies. For instance, .students may ask 
themselves questions about information in order to discover knowl- 
edge gaps. Affective strategies allow students to persist longer at dif- 
ficult learning tasks and feel nmrc effective. F'or example, when 
students schedule study sessions before an exatnination as a way to 
relieve anxiety, they are using an affective strategy 

The assessment of metacognitive learning strategies caitnot rely cm 
traditional tests. In research .studies, students' metacognitive learn- 

Id 



ing is exposed through structured interviews, self-report measures, 
observations, and occasional paper-pencil tests. In performance as- 
sessments, students provide extended oral or written responses that 
may reveal the nietacognitive learning being used. In performance 
assessments, teachers can examine an essay, a science experiment, 
or a detailed written justification of a mathematics solution to gather 
evidence of the nietacognitive priKcsses a student is applying. 

Because learning and study strategies affect achievement, they must 
be assesscd separately from achievement itself. Traditional tests fail 
to provide information on metacognition or study practices. The 
Learning and Study Strategies Inventory developed by Weinstcin, 
Schulte, and Palmer (1987) attempts to remedy these deficiencies. 
It contains 10 subscales, including attitude, motivation, time manage- 
menii, anxiety, concentration, information priKcssing, selecting main 
ideas, study aides, self-testing, and test strategies. This inventory can 
help teachers to design optimally effective teaching and learning strate- 
gies for all students. 
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Technological Developments 



1 cchnological dovolopinonts have lijjhtonod the work i)t' psy- 
chDiuctris* and educators by making assessments easier to develop* 
administer, and score. Computers can make assessment more et'tlcient 
as well as create new learning environments. However* tor technol- 
ogy to fulfill its promise, critical economic and technological barriers 
must be surmounted. 

Test Development and Scoring 

As computer capacity and speed have increased, computiTs have 
become more widely used in all aspects ol* testing, including managing, 
storing, and updating item banks. Item banks make it possible to de- 
velop customized assessments. Test items can be stored electronical- 
ly by instructional objective, technical characteristics, and other 
categories. CD-ROM technology is l^ing developed to store longer 
items tor which students must construct or produce a response. Com- 
puters also have been used to generate tests using laser printers, which 
allow complicated drawings to be included. 

The technology of "mark-sense" (or scaniiable) answer sheets made 
large-scale assessment much more feasible and made die printing of 
thousands of answ er booklets obsolete. Optical mark-reading equip- 
ment can score more than ().(XX) answer sheets in an hour. 

Currently, computers can score free-response items by comparing 
students' resfionses with keyword lists or previous answers that have 
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been sorted into correct, partially correct, and incorrect categories. 
Computers also have been used to score students* writing. They can 
provide general essay evaluations and specific suggestions for word 
use and sentence construction. But one difficulty in the computer scor- 
ing of essays is that the written works cannot be easily converted into 
machine-readable lorm. 

Computer software can select, order, and administer test items to 
individual students at their convenience. These administrations usually 
require micnvomputers and ma> include the use of televisions, slides, 
or audio recordings. Computer-based administrations do not require 
students to record their answers on test booklets or answer sheets. 
Rather, students use a kc\ board, mouse, or touch-sensitive screen. 
In the future, students should be able to write their answers on a com- 
puter screen. 

C\Muputeri/ed testing affords greater staiidardi/ation of conditions, 
because the computer can present identical screens to all students. 
Students can take computerized tests at their own convenience and 
pace in public libraries or at home, using nuKlems w ith Mumb termi- 
nals" or inexpensive personal computers. Computer leased assessment 
makes it possible to administer individual mastery tests or criterion- 
referenced examinations to a classroom ol students, each of whom 
is at a different level of competence. The cimiputer selects the ap- 
propriate items and the p^>int at which to discontinue testing for a given 
diagnosis, thus reducing the amount ol time that the student or the 
teacher needs to devote to classroom testing. 

Moreover, test security can be enhanced. Since passwords and en- 
cryptions are employed, no pa|K*r version of an assessment need exist. 
Test items can be sequenced randomly among computers to reduce 
the chance for students at adjacent computer stations to cheat. A final 
advantage is the wide range ol stimuli that can Ix? employed in the 
computerized presentation* including audio and video material. Video- 
iliscs can store up to S4»(X)0 still images or 30 minutes lull-motion. 
ct>K>r, video images (Blando and Ryan IW2). 

1*> 



Adaptive Testing 

In CDmputcri/.Cil adaptive testing, the ci>mputcr uses a student\ 
previiuis ans\\ers U) select subsequent items that are most suitable in 
terms of i>ptimal measurement, motivation, and time savings. Com- 
puterized adaptive testing can cut testing time in half because Tewer 
items are required for reliable assessmeni. 

Adaptive testing can be used lor diagnosis and instructional feed- 
back; selection, placement, and certification; and accountability or 
svstem monitoring. h>r example, the Portland Achievement Level 
Testing program is a combined norm- and criterion-referenced bat- 
lor\ employing computerized adaptive testing. The testing program 
serves three purposes: I ) to test students when thrv enter the district 
in i>rder to place them in appropriate instructional programs. 2) to 
provide continuous assessment of the students throughout the school 
Near, and 3) to select students for placement in special programs at 
any point during their enn>llment. In addition, a version of the com 
puteri/ed adaptive test has been used for such accountability lunc 
?u>ns as the evaluation of compensatorv education programs U' .S, 
C\>ngress. Ofllce of Techin>K>gy Assessment l*>^2) 

Integrated Learning Systems 

I'our decades ago. Ralph I vler pointed out that "Measuremet; 
[should be] conceived, not as a process quite apart liom instruction, 
but rather as an integral part of it ' ( l*).*i I . p 47) lnlcgiaie<l learning 
systems (II.S) are computer ss stems that |H.'rmit an individual student's 
lest results \o guide instruction. Hctause computers can store laige 
nuinbei s of items and rapidU calculate estimates ot a student's abilit) 
foth>vving the administialton ot each item, shorter tests that arc in 
ilivtdually suitcil to the student and prov idc itcarl) instantaneous feed* 
hack aisi) may enhance nu>tivation and utility. 

ILS technology is guided by two aspects (>f curriculum. One is 
instructional ext>cMiences that move students through the domain of 




content to accomplish educational goals. The other is the set of course 
standards that serve as milestones of beginning, intermediate, or ter- 
minal accomplishments. ILS make use of instructional activities lhai 
move students along a path of expertise marked by testing milestones. 
Thus ILS arc able to provide continuous analysis, diagnosis, and 
monitoring of student learning. 

ILS items can be presented as part of the instructional priK'Css, The 
successive screens displayed on the computer provide presentations, 
checks for understanding, practice, coaching, and feedback. At the 
request of the student or teacher, a "mastery map" can be displayed 
on the monitor that shows what the student has accomplished and what 
standards are yet to be completed. The standards can reflect student, 
teacher, or district goals. ILS may include the following features: 
displays of student progress and options, directed practice on tasks, 
on-line prototype answers to a.ssessnients, cumulative archives of in- 
dividual student data, computer-guided coaching, predictions of learn- 
ing rates to guide review, and niinute-by-minute analyses of classnxMii 
and group performance to aid in classroom, school, and district in- 
structional management. Over the past 30 years, ILS developed by 
\endors such as WICAT. Computer Curriculum Corporation, and 
Jos!ens have been implemented in a variety of school districts. 

Conventional eomputer-nssisted instruction programs, an early 
application of ILS. have been studied carefully. Research syntheses 
show that their effects on specific, short-term learning outcomes are 
jirciitci" than those of conventional teaching (Niemiec and Walber^" 
1^)87). However, the data on more advanced ILS applications using 
general, long term educational outcomes have been less convincing. 

Inti'iiiKcnt Mvasuri'inent 

'i1ie use of knowledge bases and infereneing procedures jK^nnils 
computer systems to produce "intelligent measurement" (Bunderson. 
Inouye. and Olsen 1989). Intelligent measurement requires a knowl- 
edge base that contains ex|vrtise specific to a subject area. Three types 
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of intelligent measurement are applicable toodiieatliMi. The Hrst tSjK* 
provides prescriptive advice, or Intelligeni Interpretations. An ex|vrt 
system can nuxlel the knowledge of a teacher who is latnillar with 
the subject area, the instructiimal management system, and the cur 
riculum. The expert system can nnnlel g«HHl |vdagogN , match instruc- 
tion w ith characteristics learners, and generate tiaicctui tcs t>l student 
progress. 

A second educatiiuial application is automatic holistic scoring. 
Knowledge bases used in automatic holistic scoring represent mas- 
tery standards t\)r the a*tsessmcnt tasks and the scoring knowledge 
of experts. In auti)matic tu>listic scoring, an ex|KM*l system fKrforms 
the complex scoring of assessments, replicating the judgments made 
by human scorers. 

A third application of intelligent measurement is the automation 
of individual pn)rile interpretations. I hc output of the computer in- 
cludes: I ) questions tor *;ounseloi*s to ask students in order to clarify 
the students' performance and 2) interpretative commentaries. The 
automation of individual profile interpretations reduces the need for 
each teacher to be an expert in inlcrprctmg assessment results. 
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Adequacy of Current Assessments 



assessment has increased in importance, the technical merit ol' 
individual assessments has become more critical. Technical tuorit may 
be evaluated using, for example, the Sumdanis for Ediu atiomil and 
Psychological Tests (1985) or the ETS Standards for Quality and Fair- 
ness (1987). Because there is no consensus about what constitutes ap- 
propriate technical standards for alternative assessments, determining 
their merit is more difficult than determining ;he merit of traditional 
assessments. For example-, how does one determine the merit of a 
geometry assessment that promotes enthusiasm by allowing students 
to use their artistic talents in answering questions but that docs not 
meet traditional technical standards of validity and reliability? Three 
areas of debate have emerged: purposes of assessment, standards of 
technical quality, and costs. 

Purposes of Assessment 

Asses-^ments should fulfill at least one of three purp«)ses: I ) the 
monitoring of individual student progress and the diagnosis o\ kmv 
ing difficulties, 2) the placement and certification of Individual stu 
dents, and 3) the evaluation an^^^comparison of groups to ensure the 
accountability of the education system (Resnick and Hesnick IW)). 

Monitoring and Diagnosis of Student Uarning, Monitoring Indl 
vidual student progress and diagnosing learning difllcultles aid teachots 




in classrtH)ni inanagcnient. Monitoring student progress usually is ae- 
conipltshcd with teachcr-niade tests that help teachers determine 
instruction. Some teachers criticize standardized achievement tests 
as lacking the capacity to provide diagnostic information tor ihc 
enhancement of day-to-day instruction. 

Traditional tests have been criticized for their dependence on recog- 
nition items, their limited coverage of domains of knowledge, and 
their alleged failure to elicit a range of higher-order thought priKesses. 
Traditional multiple-choice achievement tests, it is argued, place too 
much emphasis on facts and procedures for solving well-structured 
problems that are presented without context. In addition, these tradi- 
tional tests are limited in theit utility to identify the characteristics 
of a student's learning. Of course, traditional tests were never intended 
to do all these things. They were intended as complements to such 
teacher assessments as essays and laboratory assignments, which can 
accomplish these things. 

Critics of multiple-choice items argue that such items arc easier 
because they require only that the student recogni/c a correct answer. 
This is in contrast to fill-in-thc-blank or essay items, which require 
a student to recall appropriate informatiim and to gencraic a rcs|X)nsc, 
In addition, multiple-choice items sometimes can be answered cor- 
rectly by guessing. 

According to Ward, RtKk, and La Hart ( I W)), traditional and al- 
ternative item formats can be arranged along a continuum based on 
several dimensions: 1) sclcction/identillcation, 2) reordering/rear- 
rangement, 3) substitution/correction, 4) completion» 5) construction, 
and 6) presentation/performance. For example* multiple-choice items 
are considered the most constrained, because they do not require a 
student to generate a response. Rather, the student selects an answer 
from those presented. Proponents of alternative assessments assert 
that the items used in alternative assessments arc less constrained and 
can be used to measure more realistic and complex problem-solving 
than can multiple-choice test items. An example of such an alterna- 
tive is shown in Figure 1 . 
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Draw a line connecting the sun, the cat, the plants, and the mice to 
show the direction In which energy travels through the food web. 




FlKurv L A Icss-constratncd item from an alternative assessment. 

Proponents of performance-based alternative assessments empha- 
size the complexity and authenticity of less-constrained items. They 
argue that performance assessments derive their value from examin- 
ing actual performances, rather than from examining indicators of 
potential performances, as occur with traditional tests. Following is 
a form used to score a complex performance-based assessment that 
currently is being used on a small scale. In this assessment, "Students 
used a laboratory setup to determine which of three paper (owels held 
the most and least water* (Shavelson and Baxter 1W2, p. 21). 

Paper Towels Investigation: Hands-on Score Form"" 
1. Method 

A. Container B. Tray (surface) 

Pour water In/put towel in Towel on tray/pour water on 

Put towel in/pour water in Pour water on tray/wipe up 



♦Adapted from Shavelson. R.J., arul Baxter. G.P. "What Wc\e Learned 
AlxHit Assessing Hands-on Science^ iuituvtional U'luUrship 49, no. 8 (1992): 
21. 



2. Saturation A. Yes B. No 



3. Datermlna Result 

A. Weigh towel 

B. Squeeze towel/measure water (weight or volume) 

C. Measure water in/out 

D. Time to soak up water 

E. No measurement 

F. Count # drops until saturated 

G. See how far drops spread out 

H. Other 

4. Care in Measuring Yes No 

5. Correct Result Yes No 

Many researchers and educators are wary of claims that such al- 
ternative assessments can replace multiple-choice tests for monitoring 
student learning. Many valued educational tasks require simple recog- 
nition, and some skills (including higher-order thinking) and subject 
areas can be competently assessed using a multiple-choice format. 
Figure 2 shows a multiple-choice item that tests higher thought 
processes. 

You are building a staircase out of cubes: 

1 step = 1 cube y 

2 steps = 3 cubes 

3 steps = 6 cubes — 



a. 36 cubes . 

b. 28 cubes ' ' ' ^ 

c. 21 cubes 

d. 15 cubes 

Vkkww a Sitmplcr <\t Miifhnmttn s AsscssnwuL Califortiia iVpiit lincMt ol lulu 
caliim, p. 4(). Reprinted wilh permission. 

FiKure 2* An exuniple of tin nihttiiced niultiple-cholix* Itetii thut meas- 
ures hlKher-nrdiT thaiiKht prm^esses. 



How many cubes does it take to build a 
staircase that is 6 steps high? 





Some advocates argue that alternative assessments provide a multi- 
dimensional view of a particular skill or content area. Yet breadth 
of coverage often is traded for depth of coverage. Performance as- 
sessments arc based on one or a small number of tasks and thus may 
assess only a limited sample of what a student knows compared to 
the do/ens of facts and ideas than can be assessed using niulliplc-choicc 
items. 

Researchers have examined the equivalence of multiple-choice and 
allernalive assessments in various subjecl areas. In this research, the 
knowledge measured and the scores assigned using multiple-choice 
and alternative items - in particular, open-ended items - are very 
similar, if the knowledge and scores assigned are the same, then the 
capacity of these two types of assessments to diagnose learning 
diftlcutlies is equivalent. Thus alternative assessments, which may 
be costly and may lack technical standards, have yet to demonstrate 
more value than teacher-made and traditional multiple-choice tests. 

CvrtijUmum and Placement of Students, A second purpose of test- 
ing is to makv decisions about placement in instructional programs 
aid ccrtillcalion of mastery of a content area. Tests used for this pur- 
pose do not inform the management of daily classroom activities but 
arc used to make administrative decisions about a student^ progress 
through the school system. 

Tests used for making high-stakes decisions, including placement 
and certification, must meet high technical standards that warrant their 
use as the single piece of evidence for making decisions about a stu- 
dent's future. Of ci^urse, belter decisions are made when multiple 
sources of evidence are used. ^ 

The use of traditional and alternative assessments for the placement 
and eertillcation of a diverse student body has been challenged with 
evidence of racial, ethnic, and gender differences in performance on 
these tests. Some critics contend that traditional tests are inherently 
biased and prixluce an adverse impact on some groups of students, 
and thus should play a minor role, if any, in high stakes decisions. 

Z'l 
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But cautions regarding technical quality, equity, bias, and adverse 
impact pertain to all types of tests and assessments used for high-stakes 
decisions, including alternative assessments. 

Some researchers indicate that some alternative forms of assessment, 
at least initially, widen the performance gap between males and 
females and between siKioeconomic and ethnic groups. In contrast, 
others cite evidence that the written essay part of advanced placement 
exams in various subject areas prixluce smaller gender differences 
than do the multiple-choice parts of these exams. Less information 
is available about the reliability and validity of alternative forms of 
assessment. Again, there is the question of pnx)f. Simply criticizing 
multiple-choice items or misuse of traditional tests hardly makes an 
affirniative case for alternative assessments. 

Comparison of Groups for Aivotoitahility, I'lie third purpose of 
assessment is the coinpurison of groups of students in order to evalu- 
ate scln>ols, programs, states, ami nations, and thus maintain account- 
iihility. rypically, multiple choice tests have been used lor this 
piir|x)se. Recently, alterinuive assessments using constructed-response 
items have been considered lor use in large-scale testing programs. 

Improved accountability requires not only accurate comparisons 
among groups but also measures of student performance on content 
in which students have received instruction. School district goals, in- 
structional materials, methods, curricula, and assessments often are 
poorly aligned or integrated. School districts develop ItKal goals and 
curricula but depend on conuiiercial textbooks and standardized tests 
for instruction and assessment. Because these commercial textbooks 
and tests are prepared for national use, they rarely retlect all the lo- 
cal educational priorities. 

'Hiis mismatch may lead to inefficiencies and morale problems, 
leachers may use instructional materials that fail to reflect district 
goals, even though they will be evaluated in part on their students* 
attainment of those goals. Students may be examined on knowledge 
and skills they have not studied, or they may study content that is 



not considered a priority in their community but on which they will 
be tested. These mismatches have been identified as a cause of pix^r 
educational pnxluctivity in the United States. Pix3r alignment of in- 
struction, materials, and tests cannot be eliminated simply by using 
alternative assessments, although the Uval development of assessments 
can ensure more agreement among the elements of instruction and 
their assessment. However, IcKal assessments cannot serve the pur- 
poses of comparing districts, states, or nations. 

The National Assessment of Educational Progress (NAHP), a con- 
grcssionally initiated survey of educational achievement, is a testing 
program used for monitoring student performance at the national and 
state level. Since 1969, NAEP has collected assessment diita in read- 
ing, mathematics, science, writing, history /geography, and other 
fields. NAEP draws on a representative sample of schools that par 
ticipatc in the assessments, in 1990, for the first time. NAMI* con 
ducted slate-by -state comparisons as part of the niathcnuitics 
assessment. Advocates of NAHP believe that these comparisons hold 
educators accountable for their students' jKrformance over time and 
alst> for the level of performance their students display in compari- 
son to similar students naiionwide. 

Recently, both major |)olitical parties advocated the estahhshmeni 
ol a national examination s>stem to monitor the nation's schools. Re- 
ferred to as America 2(KK) during the Hush Administration and Goals 
2(KK) in the Clinton Administration, this examination system pro|>i>ses 
"world-class siniulards" in l-nglish. mathematics, histors . science, and 
geography (U.S. Pepartment of lAhication I^J9I). 

Stundurds of l echnicul Quality 

Reliability and validity are two key psychometric concepts that arc 
applied to the evaluation of tests and assessments. 

Rvlidhility is the consistency with which a test measures content. 
One ly|>c estimates the consistency of the test to measure the same 
individual's |vrformance on several iKcasions (test-retest reliability). 
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Other typos of reliability establish the equivalence of several forms 
of a test (parallel and alternale-forms reliability). Still other types of 
reliability establish whether each item on a test measures the same 
content (split-half or alpha reliability). Finally, inter-rater reliability 
estimates the ci)nsistency with which raters assign scores to an in- 
dividual's performance. 

The reliability of traditional multiple-choice tests is well 
documented, in ci)ntrast, little information is available on the relia- 
bility of many alternative assessments. The reliability information that 
is available tends to fiKUs on inter-rater reliability of performance- 
based alternative assessments. Preliminary evidence shows that expert 
raters often lack consensus on their assessments of written essays, 
laboratory exercises, and other alternative tasks. 

Vermont initiated the first statewide assessment to measure student 
achievement using portfolios, one of the more popular forms of al- 
ternative assessment. This statewide assessment is one of the few al- 
ternative assessment projects being dispassionately evaluated by an 
external evaiuator. A recent article describing \hn assessment (Viadero 
I W) revealed the diff iculty of establishing adequate reliability when 
using alternative test formats: "A 1992 report by the RAND Corpo- 
ration . . . finds that the 'rater reliability' in scoring the portfolios 
. . . was very low" (p. 18). Because of low rater reliability, the results 
were not reported at the school or district level. 

Validity refers to whether a test measures what it is claimed to meas- 
ure. Recently, psychometricians Linn, Baker, and Dunbar ( 1991 ) de- 
veloped an expanded definition of validity that applies to alternative 
and traditional assessments, in their view, the evaluation of validity 
for all forms of assessment should, at the minimum, include evidence 
regarding directness and transparency, consequences, fairness, trans- 
fer and generali/.ability, cognitive complexity, contenl i|uality. con- 
tent coverage, and mcaningfulness. 

Dinrmess refers to the extent to which the assessnu;nt task matches 
the instructional goals. An example of the direct assessment of* writ- 
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ing is when students arc asked to pnxluce a writing sample. In c^Mi- 
trast, an indirect assessment of writing skill requires students to an- 
swer multiple-choice questions about correct punctuation or stylistic 
considerations. Transparency refers to the clarity of the criteria used 
in judging performances. Assessments with high transparency have 
high acceptability and are viewed as legitimate measures. Directness 
and transparency can be viewed as components of /mr valuiiiy. 

Fairness requires the identification of potential sources of bias, sucli 
as rater effects or insensitive or irrelevant materials, Bias sometimes 
can be detected using statistics that identify items on which groups 
of test takers perform differently. These differences may not reflect 
true differences in test takers* knowledge but, rather, differences in 
their cultural ex|K*riences. Adverse impact on identified groups of stu- 
dents must be considered when judging the fairness of an assessment. 
An assessment should not result in members of any racial, gender, 
or ethnic group being evaluated differentially, assuming all groups 
of students are equally qualitled. In addition, some students may need 
to he taught how to take tests more effectively in order to ensure a 
fair evaluation ol their knowledge. 

(*ritics or traditional assessments question the de|;.ree to which suc- 
cessful iK'rIornuuice on traditional assessments tnuislers to real-world 
activities. IVrlormance assessments are believed to have increased 
transfenibility to non academic tasks. 

H(»wcver. pcrlormance asscssnicnls present s|K*cinl problems in 
terms ol their generali/ability . Because develoivrs ol |KTfornumce 
assessments create tasks that are held to he realistic, complex, and 
contextuali/.eil, the assessment tasks require more time than traditional 
tests. As a result, fewer tasks cnn be administered. Thus such assess- 
menis provide lewer incidences of student behavior and a limited sam- 
ple of student knowledge and skills. 

When an assessment requires the test taker to use several abilities, 
as opposed to a simple, less developmentally advanced way to solve 
problem#>» it is considered cognitively complex. Complexity should 




be determined by analyzing the types of skills and prtKesscs students 
use to answer questions. Students can correctly answer items using 
priKesses and strategies other than those expected by the test de- 
velopers. Thus items that were designed to assess students* higher 
thought priKCsses may be solved using less-advanced approaches, or 
vice versa. 

Judging the quality of an assessment should include a review of 
its content. Adequate content coverage should express the breadth and 
depth of the subject. Subject matter experts should systematically de- 
termine whether the assessment adequately covers current ideas and 
material of long-standing importance. This type of review is particu- 
larly important in the case of performance assessments that sample 
only a limited aspect of a subject area. 

Whether students and teachers perceive assessment problems as 
meaningful affects their motivation and performance. When assess- 
ments are meaningful to students, their content is relevant to the stu- 
dents* experiences. Advivates of performance assessment believe that 
assessment can be meaningful learning. Hy this criterion, however, 
life in classrooms can never be as "authentic" as that outside. 

Costs of Assessments 

Beyond consideration of the money s|KMit on all types of assess 
nients, educators increasingls are concerned ai)out the time stuilents 
spend preparing for and taking tests and the time teachers s|K'nd 
preparing, administering, and scoring tests. 

Snulvnt Tiffw Spent on Tvstin^, Based on a national survey, research 
ers Dorre-Brcinme and Herman {l9Hf>) concluded that only modest 
amounts of student time are devoted to testing. At the elementary level , 
total testing time, in all subjects, averaged 7fi hours a year, or H.(>'/^ 
of the total class tune of students. Itlementnry students took a test 
in reading and a test in math about once every eight days. High school 
students spent about 12^;? of their time taking tests in Knglish and 
mathematics classes. /\ typical lOth-grader spent nearly hours 
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annually completing tests in English and 24 hours annually complet- 
ing tests in mathemaiics. A high school student took an English lest 
and a mathematics test every three or four days. 

Dorre-Bremnie and Herman also found that both high schix>l and 
elementary students spent the largest percentage of their testing time 
on teacher-developed tests and the next-largest percentage on tests 
included with curriculum materials. In contrast, minimum competency 
testing, on average, consumed a very small percentage of testing time. 
State- and district-mandated tests took about 25% of high school stu- 
dents total testing time. 

Teacher Time Spent on Testing. According to Dorre-Brenime and 
Herman, for each hour a student spent taking a test, a teacher sfjent 
two to three hours preparing for the test, grading the test, and record- 
ing students' scores. Interviews with elementary teachers indicated 
that they spent about 12*;{ to 15% of their work time, both in and 
out of school, on achievement testing in all subject areas. This aver- 
ages to about 2()0 to 250 hours throughout a school year. Similar 
figures were not available for high school teachers, hut the research- 
ers claimed that high school teachers spent abinit two hours outside 
the class for every hour of student testing. 

riie amount of time teachers devote to alternative assessments also 
has been a subject of debate but has not been well researched. Ac- 
cording to one report, teachers in Cireat Britain. wln> have heavily 
relied on alternative assessnjonts in the past few years, are displeased 
with the time commitmeni that such tests ret|uire. 

Although alternative assessments may rei|uire more teacher time 
for development of the assessments and the training in their use, such 
time can be viewed as a benefit rather than a cost. l*or exam|>le, 
teachers involved in developing and scoring the California Assess 
nient Program report that these processes are the most effective slafi 
development activity in which they have participated (Carlson l^^>l ). 

Costs. The three basic ci^sts incurred when conducting traditional 
or alternative assessments are: I ) money costs. 2) non-itioney costs. 
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and 3) estimated opportunity costs. Money costs are the dollars spent 
on development, administration, scoring, and reporting results. Al- 
though estimates vary on the exact money costs of traditional tests 
and performance assessments, experts estimate performance items to 
be much more expensive. According to the U.S. Congress. Office 
of Technology Assessment: 

The costs of performance assessment represent a substaniial barrier 
to expanded use, Perfoniiance assessment is a labor-intensive and there- 
fore costly alternative unless it is integrated in the instructional pro- 
cess. Essays and other perfomiance tasks may citsi less to develop than 
do multiple choice items, but are very costiv to score. One estimate 
puts scoring a writing assessment as 5 to 10 times more exjH'nsive as 
scoring a multiple choice examination, while another eslinuite based 
ou a review of several testing programs administered by HTS . . . sug 
gests that the cost of assessment via one 20- to 40~miiuite essay is be- 
tween } to 5 times higher than assessment by means of a test of 150 
to 2(K) machine scored, multiple choice items. Among the fact or *i that 
itillucnce scoring costs are the length of time students are given to com- 
plete the essa> . the number of readers scoring each essa> . qualillca 
tions ami location of readers (which affect how much they are paid, 
and '.ravel and hnlging costs for the scoring prtKess). and the amount 
ol pretest in j: cc)nilucted on each pnunpt or question. I'he higher these 
lactors. the hi}'her the ratio ol essa> to multiple cln>ice cjists. (1002. 
p :-M) 

Non money costs for Irudilionul aiul iillernjilive assessments include 
e\|Knditures ofemployee iime. inmerlals. equipment, space, and ener- 
gy. Other non nmney costs nuiy he stress and a decrease in morale 
ior students, teachers, and adminislraltws. The enthusiasm produced 
by some of the hands on activities used in alternative assessmetits nuist 
he weighed against the ex|HMulitures ol lime required to adtninisier 
and score |)erh»rmance assessments. Ttadilional tests often are met 
with less enthusiasm hy teachers ami sludeitis but require a more minJ- 
est expi'tuliluie of tune, materials, and space. 




Opportunity costs require educators to consider what is displaced 
for students and teachers when a testing program is implemented. 
When resources of time, money, and energy are invested in an as- 
sessment program, they are unavailable for other uses. For example, 
the time spent by teachers on administering and scoring assessments 
should be weighed against the time that could have been used for les- 
son planning, tailoring instruction to individual students, or upgrad- 
ing teachers content knowledge and pedagogical skills. 

A second example of opportunity costs is provided by comparing 
the costs and benefits of using different types of assessments. Some 
educators argue that alternative assessments provide better data for 
diagnosing and remediating learning difficulties than do traditional 
tests. However, the opportunity costs of improved diagnostic infor- 
mation may include a loss of instructional time for students or plan- 
ning time for teachers, and a reduction in the budget due to the cx|x.'nse 
of the alternative assessment. AUernative assessments may rci|uire 
more resources on the part of the education system than their pnmnsod 
benefits warrant. 
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Conclusion 

A^sscssment is integral to the educational pnKCss, It serves three 
fundamental purposes: the day-tt)-day iiianagjnicnt of instruction, the 
classification and placement t)f students, and the maintenance of 
accountability for educatt)rs and students. Because of these funda- 
mental uses, assessment has bect)me a primary tool for the reform 
of education. In the past dcciuk\ educatt^rs have argued over the 
purposes, format, technical aikH|uacy. and ct)sts i)f assessment. Nev^ 
assessments are emerging ln»m these ilebiitcs. Some employ new item 
formats; others make use ol ct)inputcr based techiU)logies. 

Psychological research Is antilher Inlliience on assessment. Psychol- 
t>gists have argueil for assessujents that nuMisme stuilents' kn4>vvledge 
schemas. pathways U) exjvitise. ami melactignlllve lenrnlng and study 
strategies. However, leailing education lesearcheis have cautiimed 
against hasty applications nj counlllve psychology. Hiclmid Snow and 
David Lohman assert. *'(*o|Millive ph>cho|ogy has no leady answers 
for the measmeinent pioblems ol yesleiilay. lodas or lomonow'* 
(1989. p. 320). 

Alternative assessmenls. paiticulail> those (hat ale iclerreil \o as 
"authentic" iKcniise ol Iheli lellance on mmplex. teal lite tasks, are 
viewed by some as a lemeily lot (he misuse ol tiatlltlimal (esdng. 
Alternative assessmeiKs are regarded as having high lace valldKy and 
close curricular and (es( aliginnonl. ()(her advoca(es of aKernadve as^ 
sessments see these tests as the best way (o measuie sub|ec( ma((er 




expertise. They believe that expertise is heller demonsinued in iis 
sessments that require extended performances and go beyond recog 
nition items. Despite the supposed benellls allached lo alternative 
assessments, there is little evidence ol'their wide-scale leasibility, prac 
ticality* and utility. 

When the purpose of assessment is monitoring the educalirmil stand 
ing of school districts, then traditional tests may be the iLssessmenl 
method of choice. Standardization and norming are necessary when 
comparisons among groups of students are lo be made. The larger 
the pools of students being compared, Ihe more imporlani it is that 
the assessment pr(x:edure be affordable, objective, standardized, and 
easy to administer and score. These criteria are not easily met by many 
alternative assessments. Multiple-choice tests can serve the purpose 
of accountability and, with enhancement, can measure higher thought 
pr(K'esses, When selecting an assessment, educators must be atten 
tive to the trade-offs in cognitive sensitivity, technical adequacy, costs, 
and ability to fulfill the assessment purposes. 

Alternative assessments promise much, but they require sober evalu- 
ation. There is little information about the technical characteristics 
of many new forms of assessment. Evidence of difficulties in the use 
of performance assessment — one form of alternative assessment - 
has surfaced. For example. Alan Purves, director of the international 
writing assessment, recently expressed disenchantment over the 
inability to establish comparable ratings among Judges (Rothtnan 
This problem plagues not only writing assessments, but per- 
formance assessments of other subjects as well. 

Studies by psychologist Richard Shavelson (1991) also cast doubt 
on the viability <)f a performance assessment as a sole tool for assign- 
ing grades to students. Based on his project, which develops science 
and mathematics performance assessments, he reported that large 
differences in a student's scores can (Kcur depending on which per- 
formance assessment task is administered. In other words, different 
performance assessments that attempt to measure the same content 
do not rank students in the same order. 
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Computer-based technology promises to make assessment efficient 
and has demonstrated some impressive results. Optical mark-reading 
equipment, mark-sense answer sheets, micrcKomputers, hypermedia, 
artificial intelligence, and other applications have advanced assess- 
ment practices. Technologists claim that computers will be able to 
score complex, constructed rcspi^nse*-; maintain cumulative mastery 
maps of student progress; present muhimedia simulations; videotape 
student performances for further analysis; and train teachers on the 
adniinistnitioii, scoring, and interpretation of assessment results. Many 
of tlicse conil>oncMts have been demonstrated separately. What is lack- 
ing, as yet, are large scale systems that integrate a substantial part 
of the K 12 curriculum and instructionnl programs. 

In the past, several computer-based inmwations have been heralded 
as cure alls, Although the feasibility of such technologies was demon- 
stiatcd in university laborat()ries atul military and business environ- 
ments, their effectiveness in scIhh)I settings was less well-il(Kumented. 
Presumably, with national, or at least widely shared, goals that value 
technology for education, new technologies may become feasible for 
the nation's schools. Nevertheless, it is unlikely that computer-based 
tcchnoh)gies will be a panacea to our assessment ills in the immedi- 
ate future. As in the case of all forms of assessment, open-mindedness 
and healthy skepticism are in order. 
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